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ABSTRACT 

Kelly (2007, hereafter K07) described an efficient algorithm, using Gibbs sampling, 
for performing linear regression in the fairly general case where non-zero measurement 
errors exist for both the covariates and response variables, where these measurements 
may be correlated (for the same data point), where the response variable is affected 
by intrinsic scatter in addition to measurement error, and where the prior distribution 
of covariates is modeled by a flexible mixture of Gaussians rather than assumed to 
be uniform. Here I extend the K07 algorithm in two ways. First, the procedure is 
generalized to the case of multiple response variables. Second, I describe how to model 
the prior distribution of covariates using a Dirichlet process, which can be thought of as 
a Gaussian mixture where the number of mixture components is learned from the data. 
I present an example of multivariate regression using the extended algorithm, namely 
fitting scaling relations of the gas mass, temperature, and luminosity of dynamically 
relaxed galaxy clusters as a function of their mass and redshift. An implementation 
of the Gibbs sampler in the R language, called lrgs, is provided. 
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1 INTRODUCTION 

Linear regression is perhaps the most widely used example 
of parameter fitting throughout the sciences. Yet, the tradi¬ 
tional ordinary least-squares (or weighted least-squares) ap¬ 
proach to regression neglects some features that are practi¬ 
cally ubiquitous in astrophysical data, namely the existence 
of measurement errors, often correlated with one another, 
on all quantities of interest, and the presence of residual, 
intrinsic scatter (i.e. physical scatter, not the result of mea¬ 
surement errors) about the best fit. K07 takes on this prob¬ 
lem (see that work for a more extensive overview of the 
prior literature) by devising an efficient algorithm for simul¬ 
taneously constraining the parameters of a linear model and 
the intrinsic scatter in the presence of such heteroscedas- 
tic and correlated measurement errors. In addition, the K07 
approach corrects a bias that exists when the underlying 
distribution of covariates in a regression is assumed to be 
uniform, by modeling this distribution as a flexible mixture 
of Gaussian (normal) distributions and marginalizing over 
it. 

The K07 model is considerably more complex, in terms 
of the number of free parameters, than traditional regres- 
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sion. Nevertheless, it can be efficiently constrained using a 
fully conjugate Gibbs sampler, as described in that work. 
Briefly, the approach takes advantage of the fact that, for a 
suitable model, the fully conditional posterior of certain pa¬ 
rameters (or blocks of parameters)^ may be expressible as a 
known distribution which can be sampled from directly us¬ 
ing standard numerical techniques, ff all model parameters 
can be sampled this way, then a Gibbs sampler, which sim¬ 
ply cycles through the list of parameters, updating or block¬ 
updating them in turn, can move efficiently through the 
parameter space. By repeatedly Gibbs sampling, a Markov 
chain that converges to the joint posterior distribution of all 
model parameters is generated (see, e.g., Gelman et al. 2004 
for theoretical background). The individual pieces (e.g., the 
model distributions of measurement error, intrinsic scatter, 
and the covariate prior distribution) of the K07 model are 
conjugate, making it suitable for this type of efficient Gibbs 
sampling. This is a key advantage in terms of making the re¬ 
sulting algorithm widely accessible to the community, since 
conjugate Gibbs samplers, unlike more general and power¬ 
ful Markov Chain Monte Garlo samplers, require no a priori 
tuning by the user. 

^ i.e. the posterior distribution for certain parameters conditional 
on the (fixed) values of all other parameters. 
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While K07 argue against the assumption of a uniform 
prior for covariates, it should be noted that the alternative 
of a Gaussian mixture model (or the Dirichlet process gen¬ 
eralization introduced below) is not necessarily applicable 
in every situation either. When a well motivated physical 
model of the distribution of covariates exists, it may well 
be preferable to use it, even at the expense of computa¬ 
tional efficiency. In the general case, we can hope that a 
flexible parametrization like the Gaussian mixture is ade¬ 
quate, although it is always worth checking a posteriori that 
the model distribution of covariates provides a good descrip¬ 
tion of the data. K07 and Sereno & Ettori (2015) discuss real 
applications in which a Gaussian distribution of covariates 
turns out to be adequate, despite the underlying physics 
being non-Gaussian. 

This work describes two useful generalizations to the 
K07 algorithm. First, the number of response variables is 
allowed to be greater than one. Second, the prior distribu¬ 
tion of covariates may be modeled using a Dirichlet process 
rather than as a mixture of Gaussians with a fixed number of 
components. A Dirichlet process describes a probability dis¬ 
tribution over the space of probability distributions, and (in 
contrast to the many parameters required to specify a large 
mixing model) is described only by a concentration parame¬ 
ter and a base distribution. For the choice of a Gaussian base 
distribution, used here, the Dirichlet process can be thought 
of as a Gaussian mixture in which the number of mixture 
components is learned from the data and marginalized over 
as the fit progresses (see more discussion, in a different astro- 
physical context, by Schneider et al. 2015). This makes it a 
very general and powerful alternative to the standard fixed- 
size Gaussian mixture, as well as one that requires even less 
tuning by the user, since the number of mixture components 
need not be specified. Crucially, both of these generaliza¬ 
tions preserve the conjugacy of the model, so that posterior 
samples can still be easily obtained by Gibbs sampling. 

Of course, K07 (or this paper) does not provide the 
only implementation of conjugate Gibbs sampling, nor is 
that approach the only one possible for linear regression in 
the Bayesian context. Indeed, there exist more general sta¬ 
tistical packages capable of identifying conjugate sampling 
strategies (where possible) based on an abstract model defi¬ 
nition (e.g., BUGS,^ JAGS,^ and STAn"^). The use of more gen¬ 
eral Markov chain sampling techniques naturally allow for 
more general (non-conjugate) models and/or parametriza- 
tions (e.g., Maughan 2014; Robotham & Obreschkow 2015). 
Nevertheless, there is something appealing in the relative 
simplicity of implementation and use of the conjugate Gibbs 
approach, particularly as it applies so readily to the com¬ 
monly used linear model with Gaussian scatter. 

Section 2 describes the model employed in this work in 
more detail, and introduces notation. Section 3 outlines the 
changes to the K07 sampling algorithm needed to accomo¬ 
date the generalizations above. Since this work is intended 
to extend that of K07, I confine this discussion only to steps 
which differ from the that algorithm, and do not review the 
Gibbs sampling procedure in its entirety. However, the level 


^ http://openbugs.net/w/FrontPage 
® http://mcmc-jags.sourceforge.net/ 
http://inc-stan.org/ 


of detail is intentionally high; between this document and 
K07, it should be straightforward for the interested reader 
to create his or her own implementation of the entire algo¬ 
rithm. Section 4 provides some example analyses, including 
one with real astrophysical data, and discusses some practi¬ 
cal aspects of the approach. 

The complete algorithm described here (with both 
Gaussian mixture and Dirichlet process models) has been 
implemented in the R language.® The package is named Lin¬ 
ear Regression by Gibbs Sampling (lrgs), the better to sow 
confusion among extragalactic astronomers. The code can 
be obtained from GitHub® or the Gomprehensive R Archive 
Network.’^ 


2 MODEL AND NOTATION 

Here I review the model described by K07, introducing the 
generalization to multiple response variables (Section 2.1) 
and the use of the Dirichlet process to describe the prior dis¬ 
tribution of the covariates (Section 2.2). The notation used 
here is summarized in Table 1; it differs slightly from that of 
K07, as noted. In this document, A ^ B denotes a stochas¬ 
tic relationship in which a random variable A is drawn from 
the probability distribution B, and boldface distinguishes 
vector- or matrix-valued variables. 


2.1 Gaussian mixture model 


We are interested in p + m properties of some class of object, 
where p of these (covariates) are supposed to be physically 
responsible for determining the other m (response variables). 
Measurements of these p -|- m quantities have been gathered 
for n objects. The true values of the covariates for the ith 
data point are denoted ^i, and the corresponding true re¬ 
sponses are denoted rji; these are nuisance parameters that 
will be marginalized over. The measured values of the corre¬ 
sponding quantities are denoted Xi and j/i, and are assumed 
to be related to the true values by a (p-l-m)-dimensional nor¬ 
mal measurement error distribution, which may be different 
for each data point. Writing the i/-dimensional normal dis¬ 
tribution with mean /x and covariance V as A//(/x, V), this 
is (for the ith data point)® 



' A) 


p-i- m 


»7i 



( 1 ) 


The p-dimensional distribution of covariates that these 
objects originally come from is not necessarily uniform. It 
is therefore modeled in a flexible way, as a mixture of K 
p-dimensional normal distributions, 


K 

^ ^ TT/c A/p T/c) , 

k=l 


( 2 ) 


^ http://www.r-project.org 
® https://github.coin/abinantz/lrgs 
^ http://crcLn.r-project.org 

® Note that in K07 y preceded x as they correspond to rows and 
columns of M. The reverse convention is followed here. 
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Table 1. Summary of notation used in this work. Where this departs from the notation used by K07, the K07 equivalent is noted in the 
last column. 



Symbol 

Meaning 

K07 

General 

A7,(/x,S) 

ix-dimensional normal distribution (mean /x? covariance S) 


notation 

>V(V,y) 

Wishart distribution (scale matrix V, u degrees of freedom) 



Aij 

single element of matrix A 




jth row or column of A 



^3-'' ^-3 

A with the jth row or column removed 

1 

1 


In 

n X n identity matrix 


Gommon 

n 

number of data points 


parameters 

p 

number of covariates 



m 

number of responses 

1 


K 

number of Gaussian mixture components or clusters 



Xi , yi 

measured covariates and responses for data point i 



Mi 

measurement covariance matrix for data point i 

Si 



true covariates and responses for data point i 



Q. 

intercepts of the linear model 



13 

slopes of the linear model 



S 

intrinsic covariance about the linear model 

^2 


G 

mixture component/cluster identification for each data point 


Gaussian 

TT 

weight of the each mixture component 


mixture 

P-fc 

mean of the kth component 



Tk 

covariance of the kth component 



MO 

mean of the prior distribution of each /Xfc 



u 

covariance of the prior distribution of each /Xfc 



w 

scale matrix of the prior distribution of each Tfc 


Dirichlet 

n 

mean of the normal base distribution 


process 

T 

covariance of the normal base distribution 



K 

concentration parameter of the process 



a 

shape parameter of the prior of k. 



b 

rate parameter of the prior of k, 



with TTfc = 1. The summation notation in Equation 2 
is meant to convey that is drawn from the kth normal 
distribution (which has mean fj,k and covariance Tfc) with 
probability ivk- As in K07, this is implemented by means of a 
set of latent indicator variables, G, with Gi indicating which 
of the K mixture components is drawn from.® Formally, 
each G follows the multinomial distribution defined by the 
proportions n. 

The parameters G, tt, {/Xfc} and {Tk} can be learned 
from the data, but it is helpful to impose some structure on 
them. Therefore, we adopt a hierarchical model whereby the 
vectors {fJ.k} themselves follow a normal distribution, 

/Xfc ~ A/p (/^o, Ef), (3) 

and both U and the covariances {Tk} follow inverse-Wishart 
distributions, 

U ~ W-^iW,K+p), (4) 

Tfc ~ W-^ {W, K + p). 

Here iV,v) denotes the inverse-Wishart distribution 

with scale matrix V and u degrees of freedom.^® I follow 

^ In the notation used here, Gi is simply a label 1,2,... 
whereas K07 describe each Gi as a vector with all but one el¬ 
ement zero. This distinction makes no practical difference. 

While the inverse-Wishart distribution is conjugate in this 
context, and therefore computationally convenient, it has the 


K07 in taking uniform priors on the hyperparameters 
and Note that this hierarchical model, and the Gaus¬ 

sian mixture itself, is fairly flexible but not fully general. 
While these are typically reasonable assumptions when we 
have little prior information about the distribution of co¬ 
variates, they may not be appropriate for all situations. The 
particular structure in Equations 3-4 tends to promote com¬ 
pactness in the covariate distribution; that is, if multiple, 

generic disadvantage of imposing a particular structure on the 
marginal prior distributions of the variances and correlation co¬ 
efficients that make up the resulting covariance matrix. In par¬ 
ticular, large absolute values of the correlation coefficients pref¬ 
erentially correspond to large variances. This is sometimes unde¬ 
sirable, and from a practical standpoint it can occasionally result 
in the generation of computationally singular matrices. An alter¬ 
native approach is to decompose a given covariance matrix. A, as 
SRS, where S is diagonal and H is a correlation matrix. This 
allows independent priors to be adopted for individual variances 
and correlation coefficients, which is usually more intuitive than 
using inverse-Wishart distribution, but at the expense that the 
resulting model is no longer conjugate. More extensive discussion 
of these options can be found in Barnard et al. (2000), Gelman 
et al. (2004), and O’Malley &; Zaslavsky (2008). 

The conjugate prior for /xq is normal, Mp{u, V), which is uni¬ 
form in the limit = 0. For the VE, the conjugate prior is 

inverse-Wishart, W~^(^,za), and the equivalent of the uniform 
distribution is realized by taking ^ = 0 and u = —{1 -\- d), where 
d is the size of ^ (in this case, d = p). 


© 0000 RAS, MNRAS 000, 000-000 








4 A. B. Mantz 


well separated clusters of covariates exist, the onus is on the 
data to show that they are required. 

The relationship by which the p covariates determine 
the m responses is assumed to be linear, with a normal in¬ 
trinsic scatter, 

A/m (a +S). (5) 

Note that the linearity of the mean is crucial to maintaining 
the conjugacy of the model. Here ol is the m x 1 vector 
of intercepts, /3 is an m x p matrix of slopes linking each 
response variable with each of the covariates, and S is the 
m X m intrinsic covariance matrix (assumed to be constant 
with respect to ^). This can be written compactly for the 
entire data set in matrix form, with the definitions 

= ( 1 , ( 6 ) 

Yi. = rjJ, 

B = 

where Y is nxm, X is nx (p-|-1), and B is (p-b 1) x m. The 
notation Ai. refers to the ith row of A\ likewise A.j would 
refer to the jth column. The statement of the linear model 
then takes the familiar form 

Y = XB + E, (7) 

Ei. ^ Afm (0, S). 

As noted in Sections 3.1.3 and 3.1.4, the conjugate prior dis¬ 
tributions for B and S are, respectively, normal and inverse- 
Wishart. 


2.2 Dirichlet process model 

The Gaussian mixture prior on the distribution of covari¬ 
ates is flexible, but requires us to either chose a number 
of mixture components outright or carefully check that the 
sensitivity of results to the number of components. Alterna¬ 
tively, we can constrain the distribution of covariates using a 
Dirichlet process, which describes a probability distribution 
over probability distributions. A Dirichlet process is defined 
by a concentration parameter, k, and a base distribution, Pq. 
By choosing Pq to be p-dimensional normal, the conjugacy 
relations that made the Gaussian mixture efficient to Gibbs 
sample will also hold for the Dirichlet process. Used in this 
way, the Dirichlet process can be thought of as a Gaussian 
mixture in which the number of components is marginal¬ 
ized over (Neal 2000 and references therein). The analog of 
Equation 2 is generally written 

ii p, ( 8 ) 

P - DP(Po,«), 

Po = Afp{n,T). 

Here /x and T are the hyperparameters of the base distri¬ 
bution, for which I assume uniform priors. Note that there 
is only one /x and one T, unlike in the Gaussian mixture 
model, and there is no analog of /xo, Lf or W. The remain¬ 
der of the model, namely Equations 1 and 5-7, is the same 
as above. 

In practice, for a given realization of the model parame¬ 
ters, the algorithm for realizing the Dirichlet process divides 
the data set into a finite number of clusters, with points in 


each cluster having identical values of The vector of la¬ 
bels G will identify which cluster each data point belongs 
to, similarly to its use in Section 2.1. A vector of cluster 
proportions, tt, could also be defined analogously. However, 
in practice, the procedure for Gibbs sampling the Dirichlet 
process model implicitly marginalizes over it, and so tt never 
explicitly appears in the calculations (Section 3.2). 

The concentration parameter of the Dirichlet process is 
related to the number of clusters in the data set, and can 
also be marginalized over. The conjugate prior for k is the 
Gamma distribution, 

K ~ Gamma(a, b), (9) 

where a and b are respectively the shape and rate parame¬ 
ters of the prior. If the approximate number of clusters in 
the data set is known, these parameters can be chosen ac¬ 
cordingly; otherwise, they can be chosen to be minimally 
informative (see discussion by Dorazio 2009 and Murugiah 
& Sweeting 2012, and Section 4.1). 


3 THE GIBBS SAMPLER 

Both of the models described above can be efficiently Gibbs 
sampled because they are fully conjugate. Recall that, in this 
situation, the sampling algorithm can be entirely specified 
as the set of conditional distributions used to sequentially 
update each parameter or block of parameters. Section 3.1 
summarizes the changes to the K07 procedure needed to 
sample the Gaussian mixture model when there are multiple 
response variables, and Section 3.2 describes the procedure 
for sampling the Dirichlet process model. 

In either case, an initial guess is needed for most of the 
free parameters, but this need not be very sophisticated. For 
example, it is generally acceptable to begin with the values 
of and {rji} respectively initialized to the measured 
values {xi} and {yi}, the intercepts a set to the average 
value of each column of Y, and the slopes /3 set to zero}^ 
Of course, more intelligent guesses will decrease the “burn- 
in” time of the resulting Markov chain, but generally this is 
relatively short. 

3.1 Sampling the Gaussian mixture model 

The procedure for updating the parameters governing the 
distribution of covariates (G, tt, {fJ-k}, {T^}, /xo, U and 
W, for which I adopt the same priors as K07) is not affected 
by the generalization to multiple responses, and the reader 
is referred to K07 for the details of those updates. Here, I 
review the procedure for Gibbs sampling the true values of 
the covariates and responses for each data point, {^i} and 

A pitfall of this approach occurs when few of the measured co¬ 
variates are consistent with any others within their measurement 
errors. In that case, the number of clusters is necessarily similar 
in size to the number of data points, which is not generally the 
desired result. 

Specifically, this simpleminded guess for the intercepts and 
slopes works reasonably well when the covariates have been ap¬ 
proximately centered. More generally, estimates from an ordinary 
least-squares regression should provide a good starting point. 
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{rji}, the regression coefficients, a and (3, and the intrinsic 
covariance matrix, S. 


3.1.1 Updating the covariates 


See Equations 59-65 of K07 for the corresponding discus¬ 
sion in that work. The fully conditional posterior of the jth 
covariate {j — 1,2,... ,p) for data point i is 



^ A/l I 

(10) 

where 




2 

[{Mi 

-An+iTGAjf+Ro^Y-^Ro] \ 

(11) 

^(ij) ~ 

2 

^(ij) 

(Mi-^4). + (TcrV:). 





Here j indicates removal of the jth entry or column, and z* 
and fj,* are defined as in K07, 


(4)^ 

_ f {xi)e, t = j 

i {xi, yi)e — {^i, nAty t j 

(12) 


_ f (nGi)e, £ = j 

\ (MGi)r - (^i)t, 


3.1.2 

Updating the responses 



The response variables, {rji}, can be updated by modifying 
Equations 69-72 of K07 as follows (for each j = 1,2,..., m)\ 


inAA--- 

~ A/l > 


(13) 

2 

= 

\p+3){p+3)~^^^L 

-1 

5 


303) 

= sip [(Mrkr),+, + (s- 

-1 



Here <^* is defined analogously z* in Section 3.1.1, 


ic: 


{Vi)e, i = P + j 

{xi,Vi)e - t^p + j 


and 

(qD, 


ae+l3i.^i, i = j 
ai+- inAt, tAi ' 


(14) 


(15) 


3.1.3 Updating the coefficients 


The coefficients, ol and (3, may be updated by recasting 
Equation 7 in the form of a univariate regression, 

Y = XB + E, (16) 


where Y and E are nm x 1, X is nm x (p -\- l)m and B is 
(p-\- l)m X 1. I use the following (non-unique) definitions: 


Y = 


B = 


\YmJ 

f B.i \ 

B.-m / 


(17) 


X = 


( X 0 ■■■ 

0 X 

\\ 


with the nm x nm scatter covariance being 



( InSll 

lnSl2 

In^lm \ 


lnS21 

lnS22 

• 172^2771 

S = 

V In^ml 

1 71^7712 

’ In^mm / 


(18) 


where In denotes the n x n identity. The fully conditional 
posterior for B is simply the normal distribution following 
from ordinary least-squares regression, 

B|... ~X(p+i)„.(B,5). (19) 

The mean is 

B = (X Xj X Y, (20) 

whose calculation can be broken down into (p-l- 1) x {p+ 1) 
chunks due to the structure of X, and the covariance is 



Note that it is straightforward to sample from the prod¬ 
uct of Equation 19 and a normal prior for B; this option 
is implemented in lrgs, although the default is a uniform 
prior. 


3 . 1.4 Updating the intrinsic covariance 


With E = Y — XB (Equation 7), the conditional posterior 
for the intrinsic scatter is^^ 


SI... 


' W“^ {e'^E, n - 1 ) . 


( 22 ) 


In practice, a sample can be generated by setting S 
equal to , where A is (n — 1) x m and each row of 

0 , 


A is generated as Ai. ~ Afm 


3.2 Sampling the Dirichlet process model 

If a Dirichlet process rather than a Gaussian mixture is used 
to describe the prior distribution of covariates, the procedure 


This expression assumes a Jeffreys (i.e., minimally informa¬ 
tive) prior on E. More generally, one could use a prior S ~ 
yp—1(^ j/q), in which case the conditional posterior becomes 
E|... ~ >V“i (b'^B-I-T', IT' + ^ 0 ) ■ The Jeffreys prior corre¬ 
sponds to ^ = 0 and i^o = while ^ = 0 and i^o = ~(1 + 
corresponds to a prior that is uniform in |S| (see, e.g., Gelman 
et ah 2004). The K07 algorithm makes the latter assumption. The 
default in LRGS is the Jeffreys prior, but any inverse-Wishart prior 
can optionally be specified. 


© 0000 RAS, MNRAS 000, 000-000 
















6 A. B. Mantz 


to update the differs from that given above. This step 
implicitly updates G, which now identifies membership in 
one of a variable number of clusters (data points with iden¬ 
tical values of ^). In addition, Gibbs updates to the hyper¬ 
parameters of the Dirichlet process and its base distribution, 
K, fi and T, are possible. These are described below. Note 
that the updates to G, tt, {/.tfe}, {Tfe}, //o, U and W given 
in K07 are no longer applicable (of these, only G and tt exist 
in the model). 


3.2.1 Updating the covariates 


Let K be the number of clusters (i.e. distinct labels in G) 
at a given time. I follow the second algorithm given by Neal 
(2000), which first updates the cluster membership for each 
data point, and then draws new values of ^ for each cluster. 
For each data point i, update Gi as follows. Let 

Ik'’ = 4'’ K , k = l,2,...,K, (23) 

where Uk is the number of data points belonging to the fcth 
cluster not counting the ith data point, is the vector of co¬ 
variates shared by the fcth cluster, and Mv(x\pi, V) denotes 
the normal density, i.e. the density of evaluated 

at X. Here 

= [(Mi-i),,-h/3^S-i/3]“\ (24) 

^-(i) ^ f(i) [(Mr'zi). + - «)] ’ 


where Zi — (xi, yi — rji), and the subscript x indicates the 
range of subscripts associated with the covariates, 1, 2,. .. ,p 
(so that, e.g., [Mi~^]xx is the upper-left pxp block of ALT^). 
Furthermore, let 


Ji) _ 


= nAfp (^fj. 


^ rp 


(25) 


Each element of is the conditional probability associated 
with the covariates of the fcth cluster given the measurement 
and response variables associated with the ith data point, 
whereas is related to the probability of the ith data point 
being drawn instead from the base distribution of the Dirich¬ 
let process. A new label, Gi, is drawn from the multinomial 
distribution as 


Gi| ... ~ Multinom [(q^*\r^*')] , (26) 

after normalizing the probability vector (q^*\r^*^). A selec¬ 
tion Gi = A' + 1 indicates the creation of a new cluster, 
and in that case a new is immediately drawn from its 
conditional posterior. 


where 

rj^U) 


= fo« (^T^-LT-V) • 


(27) 


(28) 


Once the procedure above is completed, new covariate 
vectors are drawn for each cluster (fc = 1, 2,..., K) given 
the set of data points residing in it, 

^fc|... K{i2,f2), (29) 


fa = [(Mi-i),,-h/3^E-i/3] 

I i:Gi=k 

6 = fijr-V 

+ E [(Mi-izi)^-f/3TE-i(r,i-a)] 

i-.Gi^k 

and each new value is assigned to all in the corre¬ 
sponding cluster (i.e. with Gi = fc). 



3.2.2 Updating the Dirichlet process concentration 

The procedure for Gibbs sampling n is given by Escobar & 
West (1995). First, a latent variable, h, is introduced and 
sampled according to a Beta distribution, 

fc|... ~ Beta(K + 1, n). (30) 

Then, k is updated according to 


k\ ... ~ (5 Gamma [a + AT, 6 — ln(fc,)] (31) 

-|-(1 — 5)Gamma [a -|- A' — 1, fe — ln(/i)], 


where a and b are the shape and rate parameters of the 
Gamma prior on k, and 


5 = 


1 + n 


b — In(fc) 
a + K-1 


-1 


(32) 


In LRGS, the default values of a and b are chosen to be un¬ 
informative based on the number of data points, following 
the prescription given by Dorazio (2009). 


3.2.3 Updating the base distribution hyperparameters 

Using the notation of Section 3.2.1, the hyperparameters of 
the base distribution can be updated in turn as 


Ml--- 


T\... 



(33) 

ii'k - u){ik - m )^> K + p 


4 EXAMPLES 

This section provides two example applications of the meth¬ 
ods discussed above, respectively on a toy model and an 
astrophysical data set. 


4.1 Toy model 

Gonsider the case of a single covariate, generated by three 
distinct Gaussian components, and a single response vari¬ 
able. Table 2 shows the specific model parameters used to 
generate the data, and the toy data set is shown in the left 
panel of Figure 1. Because the Gaussians generating the co¬ 
variates are not especially well separated compared to their 
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Figure 1. Left: the simulated data used to fit the toy model in Section 4.1. Center: histogram of the simulated covariates, along with 
the 3-Gaussian-mixture distribution from which they are drawn. Right: the simulated data, with colors/symbols reflecting the cluster 
assignments of the Dirichlet model at one step in the fit. 


Table 2. Model parameters used to generate the toy data set in 
Section 4.1. The distribution of covariates is taken to be a mixture 
of 3 Gaussians. 


Parameter Value 


n, p, m, K 

100,1,1,3 

all Mi 

I 2 

a 

0 

d 

1 

s 

9 

all TTfc 

1/3 

Vk 

5(fe - 2) 

all Tfc 

1 


widths, the presence of three populations is not striking, al¬ 
though a histogram of the measured covariates is suggestive 
of the underlying structure (center panel of Figure 1). 

Suppose we had a physical basis for a 3-component 
model (or suspected 3 components, by inspection), but 
wanted to allow for the possibility of more or less struc¬ 
ture than a strict Gaussian mixture provides. The Dirichlet 
process supplies a way to do this. For a given k and n, the 
distribution of K is known,so in principle a prior expec¬ 
tation for the number of clusters, say 3 ± 1, can be roughly 
translated into a Gamma prior on k. Here I instead adopt an 
uninformative prior on n (Dorazio 2009), and compare the 
results to those of a Gaussian mixture model with K = 3. 

Using the Dirichlet process model, results from a chain 
of 1000 Gibbs samples (discarding the first 10) are shown as 
shaded histograms in Figure 1.^® The results are consistent 
with the input model values (vertical, dashed lines) for the 
parameters of interest (a, /3 and E). The latent parameters 
describing the base distribution of the Dirichlet process are 


Specifically, K\n,K ~ s(n, R”) k r(K)/r(K-|-n), where s is an 
unsigned Stirling number of the first kind (Antoniak 1974). 

For this particularly simple problem, the chain converges to 
the target distribution almost immediately. Gomparing the first 
and second halves of the chain (or multiple independent chains) 
the Gelman-Rubin R statistic is < 1.01 for every parameter. The 
autocorrelation length is also very short, < 10 steps for every 
parameter. 


also consistent with the toy model, although they are poorly 
constrained. The right panel of Figure 1 shows the cluster 
assignments for a sample with K = 6 (the median of the 
chain); the clustered nature of the data is recognized, al¬ 
though the number of clusters tends to exceed the number 
of components in the input model. An equivalent analy¬ 
sis using a mixture of 3 Gaussians rather than a Dirichlet 
process model produces very similar constraints on the pa¬ 
rameters of interest (hatched histograms in Figure 2). 

4.2 Scaling relations of relaxed galaxy clnsters 

As a real-life astrophysical example, I consider the scaling 
relations of dynamically relaxed galaxy clusters, using mea¬ 
surements presented by Mantz et al. (2015). Note that there 
are a number of subtleties in the interpretation of these re¬ 
sults that will be discussed elsewhere; here the problem is 
considered only as an application of the method presented 
in this work. 

Briefly, the data set comprises X-ray measurements of 
40 massive, relaxed clusters.^® The X-ray observables are 
total mass, M; gas mass, Mgas; average gas temperature, 
fcT; and luminosity, L. In addition, spectroscopically mea¬ 
sured redshifts are available for each cluster. A simple model 
of cluster formation by spherical collapse under gravity, ne¬ 
glecting gas physics, predicts self-similar power-law scaling 
relations among these quantities:^® 

Mgas oc M, (34) 

We should generically expect this, since it is entirely possible 
for a mixture of many Gaussians to closely resemble a mixture 
with fewer components; e.g., in Figure 2, we see that 2 of the 6 
clusters are populated by single data points that are not outliers. 
The reverse is not true, and here it is interesting that the Dirich¬ 
let process cannot fit the data using fewer than K = 3 clusters 
(Figure 2). 

Galaxy clusters, not to be confused with the clusters of data 
points arising in the sampling of the Dirichlet process. 

Here I take L to be measured in a soft X-ray band, in practice 
0.1-2.4keV. Since the emissivity in this band is weakly dependent 
on temperature for hot clusters such as those in the data set, the 
resulting scaling relation has a shallower dependence on mass than 
the more familiar bolometric luminosity-mass relation, Lbol ex 
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Figure 2. Histograms of parameter samples using a Dirichlet process (blue shaded) or 3-Gaussian mixture (hatched) prior on the 
distribution of covariates, for the toy-model analysis of Section 4.1. Dashed vertical lines indicate the input values used to generate the 
data set. The hyperparameters of the Gaussian mixture model are not shown; these correctly converge to the mean and width of the 
three mixture components. 


kT oc [E(z) , 

L oc 

where E{z) = H{z)/Ho is the normalized Hubble parameter 
at the cluster’s redshift. The aim of this analysis is to test 
whether the power-law slopes above are accurate, and to 
characterize the joint intrinsic scatter of Mgas, kT and L at 
hxed M and z. Taking the logarithm of these physical quan¬ 
tities, and assuming log-normal measurement errors and in- 

E{z) [E(z)M]'^^^. The exponents in the L scaling of Equation 34 
are specific to the chosen energy band. 


trinsic scatter, this becomes a linear regression with p = 2 
and m = 3. For brevity, and neglecting units, (In E, In M) —>■ 
{xi,X2) and (InMgas, InfcT,In L) — J- {yi,y2,y3)\ I also ap¬ 
proximately center the covariates for convenience. Figure 3 
shows summary plots of these data. Although measurement 
errors are shown as orthogonal bars for clarity, the analysis 
will use a full 5x5 covariance matrix accounting for the in¬ 
terdependence of the X-ray measurements (this covariance 
is illustrated for one cluster in the figure). 

Because the redshifts are measured with very small un¬ 
certainties, this problem is not well suited to the Dirich¬ 
let process prior; intuitively, the number of clusters in the 
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X1 



X2 




X1+X2 X1+X2 


Figure 3. Scatter plots showing the distribution of measured covariates and responses for the p = 2, m = 3 problem of fitting galaxy 
cluster scaling relations described in Section 4.2. An ellipse illustrates the measurement covariance for the most massive (largest X 2 ) 
cluster in each panel. The particular combinations of x and y plotted are conventional (cf. Equation 34). 


Diriclilet process must approach the number of data points 
because the data are strongly inconsistent with one another 
(i.e. are not exchangeable). Instead, I use a Gaussian mix¬ 
ture prior with K = 3, and verify that in practice the results 
are not sensitive to K (the parameters of interest differ neg¬ 
ligibly from an analysis with K = 1). 

Marginalized 2-dimensional constraints on the power- 
law slopes of each scaling relation are shown in the top row 
of Figure 4 (68.3 and 95.4 per cent confidence). On inspec¬ 
tion, only the luminosity scaling relation appears to be in 
any tension with the expectation in Equation 34, having a 
preference for a weaker dependence on E{z) and a stronger 
dependence on M. These conclusions are in good agreement 
with a variety of earlier work (e.g. Reiprich & Bohringer 
2002; Zhang et al. 2007, 2008; Mantz et al. 2010; Rykoff 
et al. 2008; Pratt et al. 2009; Vikhlinin et al. 2009; Leau- 
thaud et al. 2010; Reichert et al. 2011; Sereno & Ettori 2015; 
see also the review of Giodini et al. 2013). 

The posterior distributions of the elements of the multi¬ 
dimensional intrinsic covariance matrix are shown in the 
bottom row of Figure 4, after transforming to marginal scat¬ 
ter (square root of the diagonal) and correlation coefficients 
(for the off-diagonal elements). The intrinsic scatters of Mgas 
and kT at fixed M and z are in good agreement with other 
measurements in the literature (see Allen et al. 2011; Giodini 
et al. 2013, and references therein); the scatter of L is lower 
than the ~ 40 per cent typically found, likely because this 
analysis uses a special set of morphologically similar clusters 


rather than a more representative sample. The correlation 
coefficients are particularly challenging to measure, and the 
constraints are relatively poor. Nevertheless, the ability to 
efficiently place constraints on the full intrinsic covariance 
matrix is an important feature of this analysis. Within un¬ 
certainties, these results agree well with the few previous 
contraints on these correlation coefficients in the literature 
(Mantz et al. 2010; Maughan 2014). The best-fitting intrinsic 
covariance matrix is illustrated visually in Figure 5, which 
compares it to the residuals of y with respect to the best¬ 
fitting values of x and the best-fitting scaling relations. 


5 SUMMARY 

I have generalized the Bayesian linear regression method 
described by K07 to the case of multiple response variables, 
and included a Dirichlet process model of the distribution 
of covariates (equivalent to a Gaussian mixture whose com¬ 
plexity is learned from the data). The algorithm described 
here is implemented independently of the linmix_err IDL 
code of K07 as an R package called lrgs, which is publicly 
available. Two examples, respectively using a toy data set 
and real astrophysical data, are presented. 

A number of further generalizations are possible. In 
principle, significant complexity can be added to the model 
of the intrinsic scatter in the form of a Gaussian mixture or 
Dirichlet process model (with a Gaussian base distribution) 


© 0000 RAS, MNRAS 000, 000-000 


































10 A. B. Mantz 





Figure 4. Top row: joint 68.3 and 95.4 per cent confidence regions on the slope parameters of each of the scaling relations in Equation 34. 
Black circles indicate the self-similar expectation given in Equation 34. Bottom left; posterior distributions for the marginal intrinsic 
scatter parameters of the model. Bottom right: posteriors for the off-diagonal intrinsic scatter terms, expressed as correlation coefficients. 



Figure 5. Residuals of y with respect to the best-fitting values of x and the best-fitting scaling relations. For clarity, measurement errors 
are shown as orthogonal bars, even though the measurement covariances are non-trivial. In particular, the “by-eye” positive correlation 
in the left panel is due to a positive correlation in the measurement uncertainties. Shaded ellipses correspond to 1, 2 and 3 (t intrinsic 
scatter (in the 2 dimensions shown in each panel), according to the best-fitting intrinsic covariance matrix. 


while maintaining conjugacy of the conditional posteriors, 
and thereby the efficiency of the Gibbs sampler. The case 
censored data (upper limits on some measured responses) is 
discussed by K07. This situation, or, more generally, non- 
Gaussian measurement errors, can be handled by rejection 
sampling (at the expense of efficiency) but is not yet im¬ 
plemented in LRGS. Also of interest is the case of truncated 
data, where the selection of the data set depends on one of 


the response variables, and the data are consequently an in¬ 
complete and biased subset of a larger population. This case 
can in principle be handled by modeling the selection func¬ 
tion and imputing the missing data (Gelman et al. 2004). 
LRGS is shared publicly on GitHub, and I hope that users 
who want more functionality will be interested in helping 
develop the code further. 
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